Project 2: Supervised Learning

This project comparises of 2 Parts:

Part 1: Healthcare:

PROJECT OBJECTIVE: Demonstrate the ability to fetch, process and leverage data to generate useful predictions by training Supervised Learning algorithms. --> KNN to be used as suggested in the question.

DATA DESCRIPTION: The data consists of biomechanics features of the patients according to their current conditions. Each patient is represented in the data set by six biomechanics attributes derived from the shape and orientation of the condition to their body part.

  1. P_incidence
  2. P_tilt
  3. L_angle
  4. S_slope
  5. P_radius
  6. S_degree
  7. Class

1. Import and warehouse data:

Observation:

2. Data cleansing:

Observation:

Observation:

3. Data analysis & visualisation:

Observations:

Making checks for Outliers that might hamper our model and remove them.

Observation point: P-tilt and S-degree are the only ones with -ive data.

Data present in row 215 has produced many outliers in fields P_incidene, S_Slope, S_Degree. Hence we delete this record.

Observations:

4. Data pre-processing:

First, we'll converting the Object variable to the Categorical Variable.

Using Standard Scaler to standardize the values of each column. This is required in order to bring the input variables on same scale, which might be on different scales in the raw form

Observation: For target balancing

5. Model training, testing and tuning:

Observation:

Observations:

One more observation that we can make here

Observation:

6. Conclusion and improvisation:

  1. Observations:

    • The Best KNN model fit was developed with Neighbour value as 5.
    • The Accuracy of model came out to be 82.8%.
    • We built our model based on the corr table, finding the best varaibles to keep which produce useful information.
    • We also compared 2 KNN models and found best fit.
  2. Suggestion:

    • Row P_result was not properly maintained and did not produce much information.
    • More variables with proper Header would help understand the data more.
    • Proper description of data fields would also be of great help to understand the data.
    • Multiple outliers present in each data set could have been avoided
-----------END OF PROJECT 1---------------------END OF PROJECT 1---------------------END OF PROJECT 1----------

Part 2: Banking and Finance

• CONTEXT: A bank X is on a massive digital transformation for all its departments. Bank has a growing customer base whee majority of them are liability customers (depositors) vs borrowers (asset customers). The bank is interested in expanding the borrowers base rapidly to bring in more business via loan interests. A campaign that the bank ran in last quarter showed an average single digit conversion rate. Digital transformation being the core strength of the business strategy, marketing department wants to devise effective campaigns with better target marketing to increase the conversion ratio to double digit with same budget as per last campaign.

  1. ID: Customer ID
  2. Age Customer’s approximate age.
  3. CustomerSince: Customer of the bank since. [unit is masked]
  4. HighestSpend: Customer’s highest spend so far in one transaction. [unit is masked]
  5. ZipCode: Customer’s zip code.
  6. HiddenScore: A score associated to the customer which is masked by the bank as an IP.
  7. MonthlyAverageSpend: Customer’s monthly average spend so far. [unit is masked]
  8. Level: A level associated to the customer which is masked by the bank as an IP.
  9. Mortgage: Customer’s mortgage. [unit is masked]
  10. Security: Customer’s security asset with the bank. [unit is masked]
  11. FixedDepositAccount: Customer’s fixed deposit account with the bank. [unit is masked]
  12. InternetBanking: if the customer uses internet banking.
  13. CreditCard: if the customer uses bank’s credit card.
  14. LoanOnCard: if the customer has a loan on credit card.

PROJECT OBJECTIVE:

1. Import and warehouse data:

Observations:

  1. The 2 datasets are divided into 5000 rows each.
  2. One set contains 8 variables while another contains 7 variables, where ID is common.
  3. ID is unique value in dataset.
  4. Join should be done of the 2 sets based using key as ID.

Here, we can observe that field 'LoanOnCard' has 20 null values. We'll find which those records are in below exploratoin.

2. Data cleansing:

Observations: These Observations are on the dataset and variables and their division.

3. Data analysis & visualisation:

Observations:

Observation:

Observations:

Observations:

4. Data pre-processing:

Here, we can see the division of y data along with y_test and y_train data. The number of 0 Cases is way more than number of 1 cases. Hence, We'll perform Smote action to balance the data of Categorical class.

We can see here now, that the value for both the classes are equaly present. Balancing is now done.

5. Model training, testing and tuning:

Observations:

Observations:

In Conclusion, best fit model for our case is Logistic Regression model with Accuracy at 88%.

6. Conclusion and improvisation:

  1. Observations
    • The Best fit model in our case from LR and Naive Bayes --> LOGISTIC REGRESSION
    • Logistic Regression is not affected by overfitting than Naye Base, and it is also has good accuracy and recall value.
    • Sampling helped in greatway to make correct predictions. Without sampling, the train data performed very well, but problem will occur with the test data/unknown data.
  1. Suggestions
    • one Major drawback from data, Customer who doesnot have Credit cards, were marked for Loan On Card. This loan could on debit card, or just general loan, but this was not relevant for our test.
    • Equal amout of data for both classes should be gathered for more robust and accurate model.
    • Point of Security: Customers with card have not opted for security or has not been provided. --> this could obe one of the points targeted by Advertising company in next campaign to increase borrowers/card holders.
    • It could also be helpful, if we could maintain data based on Gender and Income so that we can understand the diversity of the customers and their behaviours.